{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "provenance": [],
      "authorship_tag": "ABX9TyPksI6bKWFGhgdid32t9EoK",
      "include_colab_link": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/github/sugarforever/LangChain-Tutorials/blob/main/LangChain_ParentDocumentRetriever.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "# LangChain Parent Document Retriever 简介\n",
        "\n",
        "当为检索而切分文档时，通常在切分大小的选择上存在困惑：\n",
        "\n",
        "- 您可能希望拥有较小的文档切片，以便它们的嵌入可以更准确地反映它们的含义。如果太长，嵌入可能会失去意义。\n",
        "- 您可能又希望有足够长的文档，以保留更完整的上下文。\n",
        "\n",
        "`ParentDocumentRetriever` 通过分级切分和存储文档数据块来取得平衡。在检索过程中，首先获取小块，然后查找这些块的父文档切片，并返回这些较大的文档片段。"
      ],
      "metadata": {
        "id": "qen1idqrUoSv"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "import os\n",
        "\n",
        "os.environ['OPENAI_API_KEY'] = '您的有效openai api key'"
      ],
      "metadata": {
        "id": "S204DlN5VZ65"
      },
      "execution_count": 1,
      "outputs": []
    },
    {
      "cell_type": "code",
      "execution_count": 2,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "jQSD_q4Qmthb",
        "outputId": "26cc5f1f-6f4e-4de7-8da1-62893d49d2c2"
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.5/1.5 MB\u001b[0m \u001b[31m11.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m73.6/73.6 kB\u001b[0m \u001b[31m4.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m405.5/405.5 kB\u001b[0m \u001b[31m30.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m90.0/90.0 kB\u001b[0m \u001b[31m9.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m3.1/3.1 MB\u001b[0m \u001b[31m64.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25h  Installing build dependencies ... \u001b[?25l\u001b[?25hdone\n",
            "  Getting requirements to build wheel ... \u001b[?25l\u001b[?25hdone\n",
            "  Preparing metadata (pyproject.toml) ... \u001b[?25l\u001b[?25hdone\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m58.4/58.4 kB\u001b[0m \u001b[31m6.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m59.5/59.5 kB\u001b[0m \u001b[31m6.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m5.3/5.3 MB\u001b[0m \u001b[31m92.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m5.9/5.9 MB\u001b[0m \u001b[31m80.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m7.8/7.8 MB\u001b[0m \u001b[31m86.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m67.3/67.3 kB\u001b[0m \u001b[31m5.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25h  Installing build dependencies ... \u001b[?25l\u001b[?25hdone\n",
            "  Getting requirements to build wheel ... \u001b[?25l\u001b[?25hdone\n",
            "  Preparing metadata (pyproject.toml) ... \u001b[?25l\u001b[?25hdone\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m49.4/49.4 kB\u001b[0m \u001b[31m5.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m67.0/67.0 kB\u001b[0m \u001b[31m6.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m46.0/46.0 kB\u001b[0m \u001b[31m4.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m58.3/58.3 kB\u001b[0m \u001b[31m6.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m428.8/428.8 kB\u001b[0m \u001b[31m25.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m4.1/4.1 MB\u001b[0m \u001b[31m57.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.3/1.3 MB\u001b[0m \u001b[31m46.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m129.9/129.9 kB\u001b[0m \u001b[31m12.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m86.8/86.8 kB\u001b[0m \u001b[31m7.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25h  Building wheel for chroma-hnswlib (pyproject.toml) ... \u001b[?25l\u001b[?25hdone\n",
            "  Building wheel for pypika (pyproject.toml) ... \u001b[?25l\u001b[?25hdone\n"
          ]
        }
      ],
      "source": [
        "!pip install -q -U langchain openai chromadb"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "## 准备一个PDF文档"
      ],
      "metadata": {
        "id": "cXjxJkwtVr8F"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "!wget https://developer.apple.com/carplay/documentation/CarPlay-App-Programming-Guide.pdf"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "aO2C6on5nETy",
        "outputId": "699793d7-7e91-4226-d098-115de87ab736"
      },
      "execution_count": 3,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "--2023-08-15 23:02:42--  https://developer.apple.com/carplay/documentation/CarPlay-App-Programming-Guide.pdf\n",
            "Resolving developer.apple.com (developer.apple.com)... 17.253.21.203, 17.253.21.201, 2620:149:a10:f000::5, ...\n",
            "Connecting to developer.apple.com (developer.apple.com)|17.253.21.203|:443... connected.\n",
            "HTTP request sent, awaiting response... 200 OK\n",
            "Length: 7694976 (7.3M) [application/pdf]\n",
            "Saving to: ‘CarPlay-App-Programming-Guide.pdf’\n",
            "\n",
            "\r          CarPlay-A   0%[                    ]       0  --.-KB/s               \rCarPlay-App-Program 100%[===================>]   7.34M  --.-KB/s    in 0.09s   \n",
            "\n",
            "2023-08-15 23:02:42 (83.6 MB/s) - ‘CarPlay-App-Programming-Guide.pdf’ saved [7694976/7694976]\n",
            "\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "!pip install -q pymupdf"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "LN9xxra6oJc5",
        "outputId": "1a1e00d4-ed1d-4860-dc81-a4a3eda4f451"
      },
      "execution_count": 4,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m14.1/14.1 MB\u001b[0m \u001b[31m71.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25h"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "from langchain.retrievers import ParentDocumentRetriever\n",
        "from langchain.vectorstores import Chroma\n",
        "from langchain.embeddings import OpenAIEmbeddings\n",
        "from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
        "from langchain.storage import InMemoryStore\n",
        "from langchain.document_loaders import PyMuPDFLoader"
      ],
      "metadata": {
        "id": "_DjYzTzdnHrg"
      },
      "execution_count": 6,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "loader = PyMuPDFLoader('./CarPlay-App-Programming-Guide.pdf')\n",
        "docs = loader.load()"
      ],
      "metadata": {
        "id": "VvP12aIInMbu"
      },
      "execution_count": 7,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "len(docs)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "nG55bSwPoOe8",
        "outputId": "2fc7f3ba-1db2-48fb-98f2-ce78accb4ff7"
      },
      "execution_count": 8,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "54"
            ]
          },
          "metadata": {},
          "execution_count": 8
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "## ParentDocumentRetriever支持的参数\n",
        "\n",
        "支持的参数中，值得关注的是 `parent_splitter` 和 `child_splitter`。它们分别指定父文档拆分器和子文档拆分器。"
      ],
      "metadata": {
        "id": "GuenXYT2VwHl"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "### 不指定 parent_splitter\n",
        "\n",
        "这时，文档不会进行父子两级拆分。原始文档即父文档。\n",
        "\n",
        "父文档存储在 `InMemoryStore` 中，子文档的嵌入数据被存储在向量存储中。本例中我们使用了 `Chromadb`。"
      ],
      "metadata": {
        "id": "DVPCv3hyWMvN"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)\n",
        "vectorstore = Chroma(\n",
        "    collection_name=\"full_documents\",\n",
        "    embedding_function=OpenAIEmbeddings()\n",
        ")\n",
        "# The storage layer for the parent documents\n",
        "store = InMemoryStore()\n",
        "retriever = ParentDocumentRetriever(\n",
        "    vectorstore=vectorstore,\n",
        "    docstore=store,\n",
        "    child_splitter=child_splitter,\n",
        ")"
      ],
      "metadata": {
        "id": "gwVnlY4MnOV1"
      },
      "execution_count": 9,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "!pip install -q tiktoken"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "-k2GEtShriIA",
        "outputId": "5c2a9141-d7ff-4fe1-e9fd-cbb76a033d1e"
      },
      "execution_count": 10,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "\u001b[?25l     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m0.0/1.7 MB\u001b[0m \u001b[31m?\u001b[0m eta \u001b[36m-:--:--\u001b[0m\r\u001b[2K     \u001b[91m━━\u001b[0m\u001b[90m╺\u001b[0m\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m0.1/1.7 MB\u001b[0m \u001b[31m2.7 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m\r\u001b[2K     \u001b[91m━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[90m╺\u001b[0m\u001b[90m━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.1/1.7 MB\u001b[0m \u001b[31m15.7 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m\r\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.7/1.7 MB\u001b[0m \u001b[31m18.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25h"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "**注意，这里我们即使让 `retriever` 自行生成文档ID，也需要指定参数 `ids` 的值为 `None`。**"
      ],
      "metadata": {
        "id": "slK0BIRTWzjF"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "retriever.add_documents(docs, ids=None)"
      ],
      "metadata": {
        "id": "vktqFn7bnP_v"
      },
      "execution_count": 11,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "list(store.yield_keys())"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "Pv7CeAhnnSF6",
        "outputId": "649d630d-2450-45b4-ed4e-003f91b7c375"
      },
      "execution_count": 12,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "['bde53b56-5891-4810-af11-6a7042726ab3',\n",
              " '4946fe18-4418-423b-9ead-58bc0f43f757',\n",
              " '15665906-0cd0-4f2f-b977-a5f5c107fc93',\n",
              " '402e060f-4e13-45c3-95c2-d4e28fff91a0',\n",
              " 'c58b06dd-d710-4ce3-9eda-20b1b3d7cacb',\n",
              " '1eac9574-6dc9-4fb0-828c-36c02a3c762a',\n",
              " 'e17e0a47-b75c-46aa-9e6d-9752b43f4295',\n",
              " '5ebcb89b-8a2a-441b-acba-e96c82f2f42c',\n",
              " '7cdd12f3-2928-4ba9-a4df-5b18b6105a63',\n",
              " 'a64caaaf-2e26-4ef3-8eda-18ccf4f0eec8',\n",
              " '0d0cc9f6-2a00-4141-adb3-dd75049f8eca',\n",
              " 'b27d1666-793c-46c1-b9df-e06f1a8540f6',\n",
              " '84a84e09-b043-490a-a07a-6d2adc1b07eb',\n",
              " 'ceff23ee-1135-4132-85aa-2864bf59c1ed',\n",
              " 'a2b9ef34-03a1-491a-82ed-2a181c008a49',\n",
              " 'abb99674-280c-4bae-a4a8-fac50d329b1f',\n",
              " '75193b94-8d2e-4529-94e5-a00227b74316',\n",
              " '765ed16f-d29b-4b6e-bf3b-b9a4aaccd235',\n",
              " 'de9d138c-f14d-47d2-8dda-b2db6f9c110d',\n",
              " 'a72b0f9c-1afb-45cd-912f-a4ff57840f14',\n",
              " 'd937a765-881c-49df-a0e1-d1689bcedcc8',\n",
              " '504d4182-f2c3-4377-9291-669c4551b668',\n",
              " '79c67194-d1a2-4a1a-8bbf-f26a8cb87aa0',\n",
              " '539f3a4d-0a45-4b5a-8f58-034484dfcfdb',\n",
              " 'a89e12d9-956d-4c9a-bc8d-9329fc990236',\n",
              " '47671423-9e21-4baf-bf25-aa310268e016',\n",
              " '70fc924f-6041-4bbf-8185-3ab191d15291',\n",
              " '1166db38-2c85-4e7e-a96a-65734e74ac2f',\n",
              " '6dd6b4c1-e1ea-499f-bf7e-00b99e6569ba',\n",
              " '39c33ca9-011b-4244-9659-269f845df7e4',\n",
              " '78606a22-9961-460d-90cc-67521b0dd2d3',\n",
              " '9c474bd2-485d-43b3-97f1-59edb0dd60d0',\n",
              " '15a694ea-a0a7-4ec1-ade1-f6c3595a9bf6',\n",
              " 'e8868b4f-0735-417a-96da-9d60058f364f',\n",
              " 'ae060d95-7a25-4b75-b297-ca1cb31ba336',\n",
              " '479cb70a-04de-4022-99b9-c8e7360804eb',\n",
              " 'e67d8669-0b19-4b01-bbc2-729f416ef4e8',\n",
              " 'd4f3313b-dc43-45c8-83d8-27a8ffd28e0e',\n",
              " '2329855f-0724-4167-9cd4-25df6041f000',\n",
              " 'e9e8f2ee-dbb9-4b11-968e-c5074503c8c6',\n",
              " '51c8d766-beb4-45da-9b9e-2c1fb17cf1aa',\n",
              " 'fa251aaa-edb5-475f-86dd-ca033f7ff485',\n",
              " 'a4a7fa4d-eeac-4fd5-ade1-3228a47ea42d',\n",
              " '6a649f7b-c2c6-4fa1-9cae-25b819e8b62d',\n",
              " 'e5f96593-bdb8-4e9a-85f0-67444554f5bf',\n",
              " 'e98efe9a-1dbe-4594-a285-ccf4a3486555',\n",
              " '27cea811-ded8-44ae-b94e-4563414889bd',\n",
              " '1437ff56-0670-44df-8f29-a070751a510c',\n",
              " 'a4f534fd-3e94-4633-9e56-cd676e9a22b9',\n",
              " '76d66913-eb14-4daa-82c5-cd75d3c0483e',\n",
              " '4de20d63-15e7-432b-b87b-a4f82c2566c9',\n",
              " 'bbfc14be-2d29-442a-bb4f-15caf184a242',\n",
              " 'aac353b8-9693-4d08-9762-5d3130dcb82f',\n",
              " 'e164f4f3-b549-4764-9b71-444b04f7ccf4']"
            ]
          },
          "metadata": {},
          "execution_count": 12
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "sub_docs = vectorstore.similarity_search(\"How to build a CarPlay navigation app?\")"
      ],
      "metadata": {
        "id": "gMq9jTNsnTk0"
      },
      "execution_count": 13,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "len(sub_docs)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "4UJT01c7nVFE",
        "outputId": "77ec95a1-6412-4600-e025-8f155d9eacd4"
      },
      "execution_count": 14,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "4"
            ]
          },
          "metadata": {},
          "execution_count": 14
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "for sub_doc in sub_docs:\n",
        "  print(len(sub_doc.page_content))"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "aW34WrCYErrK",
        "outputId": "d6264fed-965f-41fc-af1d-8cea951e0eaf"
      },
      "execution_count": 15,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "325\n",
            "391\n",
            "327\n",
            "351\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "retrieved_docs = retriever.get_relevant_documents(\"How to build a CarPlay navigation app?\")"
      ],
      "metadata": {
        "id": "IlOX0dgeEK8B"
      },
      "execution_count": 16,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "len(retrieved_docs)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "lwzSuAq1EMZR",
        "outputId": "053099cf-dafb-4327-9b1e-449ccea1c0e3"
      },
      "execution_count": 17,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "3"
            ]
          },
          "metadata": {},
          "execution_count": 17
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "for retrieved_doc in retrieved_docs:\n",
        "  print(len(retrieved_doc.page_content))"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "4JNhXL-5Eznl",
        "outputId": "7d7b3188-2db4-44b6-c491-ca85a9194c61"
      },
      "execution_count": 18,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "1562\n",
            "4179\n",
            "2195\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "### 指定 parent_splitter\n",
        "\n",
        "这时，文档进行父子两级拆分。原始文档被 `parent_splitter` 拆分成较大的块后，再由 `child_splitter` 拆分成更小的块。"
      ],
      "metadata": {
        "id": "Vrbj17phWWd-"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "parent_splitter = RecursiveCharacterTextSplitter(chunk_size=500)\n",
        "child_splitter = RecursiveCharacterTextSplitter(chunk_size=200)\n",
        "vectorstore = Chroma(collection_name=\"carplay_collection\", embedding_function=OpenAIEmbeddings())\n",
        "store = InMemoryStore()"
      ],
      "metadata": {
        "id": "m3o5hjswnXH9"
      },
      "execution_count": 19,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "retriever = ParentDocumentRetriever(\n",
        "    vectorstore=vectorstore,\n",
        "    docstore=store,\n",
        "    child_splitter=child_splitter,\n",
        "    parent_splitter=parent_splitter,\n",
        ")"
      ],
      "metadata": {
        "id": "GdbZi9pMnZGh"
      },
      "execution_count": 20,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "retriever.add_documents(documents=docs, ids=None)"
      ],
      "metadata": {
        "id": "8TlrImkSnanA"
      },
      "execution_count": 21,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "len(list(store.yield_keys()))"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "J_Y_pMckncLr",
        "outputId": "cde19e35-ed1d-49f3-e99b-7cda4e22188a"
      },
      "execution_count": 22,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "238"
            ]
          },
          "metadata": {},
          "execution_count": 22
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "sub_docs = vectorstore.similarity_search(\"How to build a CarPlay navigation app?\")\n",
        "\n",
        "len(sub_docs)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "5S0mCotXnd8D",
        "outputId": "19305a7c-08f5-48e5-82d8-8451e003a462"
      },
      "execution_count": 23,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "4"
            ]
          },
          "metadata": {},
          "execution_count": 23
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "for sub_doc in sub_docs:\n",
        "  print(len(sub_doc.page_content))"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "_d54jdzAFH4W",
        "outputId": "bb7b8bec-a801-4979-981a-a280e37861cc"
      },
      "execution_count": 24,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "152\n",
            "161\n",
            "126\n",
            "197\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "retrieved_docs = retriever.get_relevant_documents(\"How to build a CarPlay navigation app?\")"
      ],
      "metadata": {
        "id": "K21xftDznfuo"
      },
      "execution_count": 25,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "len(retrieved_docs)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "4NX40C6snkPx",
        "outputId": "57240dfc-ed12-4d25-afd3-6f817b5be29a"
      },
      "execution_count": 26,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "3"
            ]
          },
          "metadata": {},
          "execution_count": 26
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "for retrieved_doc in retrieved_docs:\n",
        "  print(len(retrieved_doc.page_content))"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "4RBXmOOWFLjX",
        "outputId": "385a6cb9-cfe9-4cc8-d50e-0c709cf0b2da"
      },
      "execution_count": 27,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "494\n",
            "408\n",
            "455\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "print(retrieved_docs[2].page_content)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "wHbrCz01nlzY",
        "outputId": "18ea38d9-d2b3-4998-d1d4-6ce8eb3d1d8e"
      },
      "execution_count": 28,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Build a CarPlay navigation app \n",
            "The following section describes how to create a CarPlay navigation app. \n",
            "CarPlay navigation apps have additional UI elements and capabilities that are different from \n",
            "other CarPlay app types. Skip this section if you are not creating a navigation app. \n",
            "Additional templates for navigation apps \n",
            "CarPlay navigation apps use additional templates to display map information, a keyboard, and \n",
            "voice control feedback. \n",
            "Base View\n"
          ]
        }
      ]
    }
  ]
}